Introduction

My name is Sokona Mangane and I’m from Brooklyn, NY. I’m a senior at Bates College, majoring in Mathematics, and minoring in Digital and Computational Studies. In conjunction with the Institute for a Racially Just, Inclusive, and Open STEM Education (RIOS) Institute , I am conducting a computational text analysis of STEM Open Education Resources (OER). In particular, I’m analyzing the “Inclusive Teaching” Section descriptions on the website CourseSource, which is a open-access and peer-reviewed journal that publishes lessons, teaching content and resources related to biology and physics; in the words of Dr. Carrie Diaz Eaton (an Associate Professor of Digital and Computational Studies at Bates College and a co-founder of QUBES, known for her work in social justice in STEM higher education), it’s like a GitHub, but for curriculum.

When publishing an article on CourseSource (can be categorized as a “Lesson”, “Science Behind the Lesson”, “Teaching Tools and Strategies”, “Essay” or “Review”), authors can write about how the article is inclusive, under the “Inclusive Teaching”, however there are currently no guidelines. Thus, this text analysis of OER submissions serves to answer what do people write about Inclusive Teaching.

Setup/Data Cleaning

Here, I’ve imported the excel data set and necessary packages for analysis. I also did some data cleaning, created a vector of DEI related words and added some variables to the original excel data set.

#cmd + shift + c to comment out code 
#cmd + shift + M to print %>% pipe operator
#cmd + return to run code 
# install.packages("varhandle")
# install.packages("skimr")
# install.packages("tidyverse")
# install.packages("tidytext")
 # install.packages("stopwords")
# install.packages("wordcloud")
# install.packages("reshape2")
# install.packages("ggraph")
#install.packages("kableExtra")

#loading necessary packages
library(varhandle)
library(ggraph)
library(igraph)
library(skimr)
library(tidyverse)
library(tidytext)
library(ggplot2)
library(readr)
library(stopwords)
library(wordcloud)
library(reshape2)
library(kableExtra)

#importing dataset and DEI words list
rios_data <- read_csv("RIOS Research - Course Source - Sheet1 2.csv")
dei_keywords <- read_csv("SJEDI_words 2022-12-20 18_03_42.csv")

#updating human error for one article
rios_data$`Inclusive Teaching  included?`[12] = "No"

#arranging years
rios_data <- rios_data %>%
  arrange(desc(Year))

# creating a new column article number, to number each article (most recent article: 286)
rios_data$article_num <- c(nrow(rios_data):1)
# I've created a variable which contains diversity related words (words pulled from the keywords column) and then combined it with the `dei_keywords` dataframe I imported (Thank you Dr. Diaz-Eaton). I also added another column, which includes the article number for each row.

# diversity_related <- c("diversity", "bias", "confirmation bias", "cognitive bias", "social justice", "broader impacts", "racism", "identity", "equity", "inclusivity", "environmental justice", "inclusion", "belonging")
# 
# #adding the vector above to the CSV dei_keywords
# for (x in 1:13){
#   dei_keywords[nrow(dei_keywords) + 1,] = diversity_related[x]
# }

Here, each word from the Inlcusive Teaching Description and keyword themes is “un-nested” into it’s own row and any unnecessary punctuation, numbers and words are removed. I saved each of these in their own new dataframes for analysis.

rios_data_tokenizedit <- rios_data %>%
  unnest_tokens(output = inclusive_teach_tokens, input = `Inclusive Teaching Description`)


#removing all rows with any punctuation, digits, or "stopwords" (~20k rows total)
strings <- c("[:punct:]", "[:digit:]","\\(","\\)")
stopwords_vec <- stopwords(language = "en")
stopwords_vec <- stopwords_vec[-c(165:167)]

#removed ~777 rows
rios_data_tokenizedit <- rios_data_tokenizedit %>%
  filter(!str_detect(inclusive_teach_tokens, paste(strings, collapse = "|")))

#removed ~19,663 rows
rios_data_tokenizedit <- rios_data_tokenizedit %>%
  filter(!inclusive_teach_tokens %in% stopwords_vec) 

#doing same thing as above but for keyword themes
# rios_data_tokenizedkt <- rios_data %>%
#   unnest_tokens(output = keyword_themes_tokens, input = `keyword themes`)


#removing all rows with any punctuation, digits, or "stopwords" (78 rows total)
# strings <- c("[:punct:]", "[:digit:]","\\(","\\)")
# stopwords_vec <- stopwords(language = "en")

#removed ~777 rows
# rios_data_tokenizedkt <- rios_data_tokenizedkt %>%
#   filter(!str_detect(keyword_themes_tokens, paste(strings, collapse = "|")))

#removed ~19,663 rows
# rios_data_tokenizedkt <- rios_data_tokenizedkt %>%
#   filter(!keyword_themes_tokens %in% stopwords_vec) 

Based on the work I did above, I transformed the data frame with all of the distinct ‘cleaned’ words in the inclusive teaching section into a csv, and manually verified if all 4,464 words (looked at the context of the words as required) should be counted as JEDI! After manual verification, I import it back into R.

#allwords <- unique(rios_data_tokenizedit$inclusive_teach_tokens)
#uniquedeirelated <- sapply(allwords, function(x) any(sapply(dei_keywords, str_detect, string = x)))

#uniquedei <- cbind(allwords,uniquedeirelated)

#write_csv(as.data.frame(uniquedeirelated), "DEIRelated.csv")


#importing manually verified list of JEDI words 
JEDI_keywords_df <- read_csv("cleanedITwords - cleanedITwords.csv")

JEDI_keywords <- JEDI_keywords_df %>% 
  filter(Carrie == "JEDI") %>% 
  select(1)

dei_related columns were created for each data frame, which says TRUE if that word (regarding the Inclusive Teaching Descriptions and the keyword themes) matches any word from the JEDI_keywords dataframe.

#creating a DEI related column
rios_data_tokenizedit$dei_relatedit = NA

# rios_data_tokenizedkt$dei_relatedkt = NA

rios_data_tokenizedit$dei_relatedit <- sapply(rios_data_tokenizedit$inclusive_teach_tokens, function(x) any(sapply(JEDI_keywords, str_detect, string = x)))

# rios_data_tokenizedkt$dei_relatedkt <- sapply(rios_data_tokenizedkt$keyword_themes_tokens, function(x) any(sapply(JEDI_keywords, str_detect, string = x)))

#save(dei_keywords, file = "dei_keywords.csv")
#saveRDS(rios_data_tokenized, file = "rios_data_tokenized.rds")

#removing the unnnecessary columns
rios_data_tokenizedit <- rios_data_tokenizedit[,-c(9:13)]
# rios_data_tokenizedkt <- rios_data_tokenizedkt[,-c(9:13)]

Exploratory Data Analysis: Word Count

Word Count of Inclusive Teaching Text Over Time Box Plot

The boxplot below visualizes the word count of the Inclusive Teaching Section over time. The Word count increases in 2019 compared to the years prior, and we also start to see more outliers. Overall, since the creation of Course Source, the word count of the Inclusive Teaching Section has increased.

rios_data$Year <- factor(rios_data$Year , levels=c("2022", "2021", "2020", "2019", "2018", "2017", "2016", "2015", "2014"))

  boxplot(`Word Count of Inclusive Teaching?`~ Year,
          data=rios_data,
          main="Word Count of Inclusive Teaching Sections Over Time",
          ylab="Year",
          xlab="Word count of Inclusive Teaching Section",
          horizontal = TRUE)

rios_data$Year <-  unfactor(rios_data$Year)

Presented below is an in-depth look at what’s visualized above.

rios_data %>% 
  group_by(Year) %>% 
  skim(starts_with("Word Count")) %>% 
  select(3,4,6:13)  %>% 
  mutate(numeric.mean = round(numeric.mean, digits = 2), numeric.sd = round(numeric.sd, digits = 2)) %>% 
  rename("Mean" = "numeric.mean",
         "SD" = "numeric.sd",
         "Min" = "numeric.p0",
         "25 Q" = "numeric.p25",
         "Median" = "numeric.p50",
         "75 Q" = "numeric.p75",
         "Max" = "numeric.p100",
         "Histogram" = "numeric.hist") %>% 
  kable() %>% 
  kable_minimal()
Year n_missing Mean SD Min 25 Q Median 75 Q Max Histogram
2014 4 106.85 58.05 34 63.00 90.0 133.00 230 ▇▇▆▁▃
2015 4 122.57 61.70 43 70.50 116.0 174.25 228 ▇▅▃▂▅
2016 3 115.80 89.83 26 79.00 103.0 127.50 453 ▇▅▁▁▁
2017 2 123.00 56.55 37 83.50 107.0 154.00 238 ▃▇▃▃▂
2018 5 124.70 80.22 34 89.75 95.0 144.75 324 ▆▇▃▁▂
2019 4 173.33 114.77 25 98.75 141.5 213.75 483 ▆▇▃▁▂
2020 2 249.45 218.46 43 126.50 203.0 276.00 1415 ▇▂▁▁▁
2021 5 224.62 170.73 43 125.75 169.0 241.75 901 ▇▃▁▁▁
2022 1 210.74 118.28 41 124.00 183.0 249.50 565 ▅▇▂▁▁

Exploratory Data Analysis: Word Frequency

What are the most common “DEI” Words in the Inclusive Teaching Description?

Only 2.6% of words in the Inclusive Teaching Text are DEI related (118/4,464). Looking at the most common DEI words gives us an idea of what DEI words are being used the most, and what does that tell us about how the authors are being inclusive. According to the table below, the words “inclusive”, “diversity”, and “diverse” are the most commons “DEI” words. This makes sense as inclusive teaching should be diverse and cater to a diversity of racial backgrounds. Out of 118 “DEI” words used in the inclusive teaching text, note that 70% of these words are repeated more than once (83/118) and 54.2% are repeated more than twice (64/118). Based on these common words, it seems like these articles try to be inclusive by being diverse, engaging, and catering to a diverse set of backgrounds and abilities.

However, the title of this section for which these descriptions are under is called “Inclusive teaching”, so one could have a lengthy description under this section, without including any of the words from dei_keywords and then mention “inclusive teaching” to be included in this category, thus, these numbers may be an underestimate. Below (2 gram analysis) you can find a data frame rios_2w_count that portrays the most common DEI 2 words phrases.

#BSS = BASIC SUMMARY STATISTICS

#most common DEI words, out of 118 (out of 4,464 words, 2.6% are DEI)
rios_data_tokenizedit %>%
  filter(dei_relatedit == "TRUE") %>%
  count(inclusive_teach_tokens, sort = TRUE)  
## # A tibble: 785 × 2
##    inclusive_teach_tokens     n
##    <chr>                  <int>
##  1 students                1389
##  2 inclusive                123
##  3 diversity                115
##  4 diverse                  100
##  5 opportunity               94
##  6 individual                72
##  7 engage                    66
##  8 environment               64
##  9 backgrounds               57
## 10 participate               57
## # ℹ 775 more rows

Word Cloud

Word clouds are another way of visualizing which words are being used the most. This word cloud shows the distinct words printed in the table above.

#ASK CARRIE WHY DOES WORDCLOUD LOOK SO 'EVEN'

rios_data_tokenizedit %>%
  filter(dei_relatedit == "TRUE") %>%
  count(inclusive_teach_tokens, sort = TRUE) %>% 
  with(wordcloud(inclusive_teach_tokens, n))

The tables and visuals above give us an idea of how often are DEI Words used and what does that say about inclusivity of the articles. However, looking at the most commonly used DEI words doesn’t give us all the information on how the article is being inclusive and their definitions of it. Thus, I’ll repeat the analyses I did above, but looking at phrases, specifically of 2 words. Unlike above, I looked through all the phrases and removes what I felt was unnecessary and/or didn’t make sense. Below we have printed the most common DEI phrases. Although some of the phrases aren’t repeated often, they show that the definition of inclusive teaching goes beyond just engaging all students.

Common DEI Phrases

2 words

rios_data_token2it <- rios_data %>%
  unnest_tokens(it_tokens_2w, `Inclusive Teaching Description`, token = "ngrams", n = 2)  %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_data_token2it <- rios_data_token2it %>%
  filter(!word1 %in% stopwords_vec) %>%
  filter(!word2 %in% stopwords_vec) %>% 
  unite(it_tokens_2w, word1, word2, sep = " ") 

rios_data_token2it <- rios_data_token2it %>%
  filter(!str_detect(it_tokens_2w, paste(strings, collapse = "|")))


#creating a DEI related column
rios_data_token2it$dei_related = NA


rios_data_token2it$dei_related <- sapply(rios_data_token2it$it_tokens_2w, function(x) any(sapply(JEDI_keywords, str_detect, string = x)))


#removing the unnnecessary columns/rows [doesn't make sense or not related to DEI] 
#however keep in mind, in the context of a sentence it probably make senses, so i could be removing some important words

rios_data_token2it <- rios_data_token2it[,-c(9:13)]

unnecessary <- c("individually table", "individual investigators", "individual module", "individual noninteractive", "individual pre", "individual clicker", "inclusion another", "inclusion additionally", "identify species", "identify primer", "identify possible", "identified alternatively", "identification", "ideas individual", "group divide", "general inclusive", "bird identification", "yet inclusive", "variants identified", "teachers identify", "skills perspectives", "sheet individually", "residential birds", "radiation incidents", "regular individually", "pipeline cure", "plant communities", "popular culture", "perspective remind", "perspectives anonymous", "four individuals", "first collaborative", "find identify", "ever identified", "evenly divides", "ethnic economic", "ethnic given", "equity public", "england individual", "engaging final", "diverse face", "diverse natural", "diverse mixed", "direct connection", "disabilities benefit", "data individuals", "data individually", "communities due", "collaborative yet", "collaborative easing", "collaboration using", "collaboration throughout", "class individual", "biodiversiy lab", "biodiversity losses", "backgrounds may", "backgrounds find", "backgrounds furthermore", "backgrounds therefore", "area identified", "answer individually", "agricultural sciences", "ability train", "ability moreover", "abilities match", "questions individually", "individuals turn", "individuals since", "communicate collaborate", "backgrounds throughout", "access see", "de identified", "efficacy identity", "individualactors may")

rios_data_token2it <- rios_data_token2it %>%
  filter(!str_detect(it_tokens_2w, paste(unnecessary, collapse = "|")))

#for review (Naz?)
all2words <- as.data.frame(unique(rios_data_token2it$it_tokens_2w))
all2words$dei_related = NA
all2words$dei_related <- sapply(all2words$`unique(rios_data_token2it$it_tokens_2w)`, function(x) any(sapply(JEDI_keywords, str_detect, string = x)))
write_csv(all2words, "2DEIRelated.csv")


#most common DEI words
rios_2w_count <- rios_data_token2it %>%
  filter(dei_related == "TRUE") %>%
  count(it_tokens_2w, sort = TRUE) 

write_csv(rios_2w_count, "rios2wcount.csv")


#graph of that 
rios_2w_count %>%
  top_n(30) %>%
  mutate(it_tokens_2w = reorder(it_tokens_2w, n)) %>%
  ggplot(aes(it_tokens_2w, n)) +
  geom_col() +
  coord_flip() +
  labs(y = "(DEI Related) 2 Word Count in Inclusive Teaching Text") + 
  xlab(NULL)

3 words

I also did a 3 gram word Analysis, which has a much a lower frequency. However, this gives us a better idea of what “inclusive teaching” means in these contexts.

## plot the frequency
rios_data_token3it <- rios_data %>%
  unnest_tokens(it_tokens_3w, `Inclusive Teaching Description`, token = "ngrams", n = 3)  %>%
  separate(it_tokens_3w, c("word1", "word2", "word3"), sep = " ")

rios_data_token3it <- rios_data_token3it %>%
  filter(!word1 %in% stopwords_vec) %>%
  filter(!word2 %in% stopwords_vec) %>%
  filter(!word3 %in% stopwords_vec) %>%
  unite(it_tokens_3w, word1, word2, word3, sep = " ") 

rios_data_token3it <- rios_data_token3it %>%
  filter(!str_detect(it_tokens_3w, paste(strings, collapse = "|")))


#creating a DEI related column
rios_data_token3it$dei_related = NA


rios_data_token3it$dei_related <- sapply(rios_data_token3it$it_tokens_3w, function(x) any(sapply(dei_keywords, str_detect, string = x)))


#removing the unnnecessary columns/rows
rios_data_token3it <- rios_data_token3it[,-c(9:13)]

#removing unnecessary content that doesn't make sense, however keep in mind as three words it might not make sense or be related to inclusive teaching, but in the context of a sentence it probably make sense

unnecessary2 <- c("academic background identity", "academic professional backgrounds", "individuals since questions", "inclusive teaching requires", "implements inclusive teaching", "many individuals since", "must communicate collaborate", "several inclusive teaching", "abilities experimental questions", "access discussion throughout", "access faculty access", "access sample work", "access see comments", "access several learning", "access software programs", "access species database", "accommodators perceived individual", "animal communities due", "american cultures science", "author backgrounds demonstrating", "background knowledge differently", "background therefore inexpensive", "backgrounds can benefit", "backgrounds find intellectual", "backgrounds may identify", "backgrounds needs learning", "backgrounds often find", "biodiversity database schools", "biodiversity lab report", "biodiversity losses amount", "biology species identified", "bird identification resources", "broader area identified", "butterfly activity engages", "campus based access", "class individual research", "class collaboration peer", "class individual research", "clicker questions individually", "coast students connected", "collaborative effort seeks", "completed courses backgrounds", "completing individual work", "connected device instructors", "connections may result", "contribute individual data", "cultured bacteria including", "cures increase access", "current biodiversity losses", "cyverse infrastructure connected", "de identified results", "different communities may", "disabilities additionally scientific", "discussing cellular diversity", "diverse demonstrators finally", "diversity among streptomyces", "diversity dna sequence", "diversity gap additionally", "divides labor among", "drift engaging students", "dynamics inclusive learning", "engagement student appropriate", "engages multiple senses", "engaging final presentation", "engaging multiple types", "engaging multiple week", "england individual investigators", "evenly divides labor", "friendly internet connected", "genetic drift engaging", "handle cultured samples", "handling cultured bacteria", "high impact active", "ideas individual writing", "identified asset maps", "identified instructors implementing", "identified personality types", "identify possible therapeutic", "identify primer binding", "inclusion another way", "inclusion standards can", "inclusion within many", "individual choice udl", "individual clicker questions", "individual self paced", "individual pre class", "individual submissions upon", "individual writing assignments", "ndividual writing peer", "individual written report", "individually completing separate", "individually first although", "individually group work", "individually providing time", "individually select groups", "individually selected research", "individuals students work", "instructor individual exploration", "instructor individual pre", "interesting biological cultural", "internet connected device", "investigate genetic diversity", "loxp lesson inclusive", "microorganisms impact human", "might impact human", "natural abilities match", "nuclear radiation incidents", "online bird identification", "open access species", "perspectives anonymous clicker", "plant communities useful", "plant identification http", "practice sheet individually", "prefer individual work", "project biodiversify website", "public biodiversity database", "purposes open access", "pursuing bird identification", "questions individual students", "questions individually discuss", "regular individually selected", "reliable internet connectivity", "revealing identifying information", "rgs encourage individual", "seafood traceability issues", "self identified asset", "self identified personality", "serpentine plant communities", "sheet individually providing", "significantly less connected", "simple model collaboration",  "site easily accessible", "spatial ability train", 
"species identified alternatively", "species identified instructors", "strategies including individual", "traceability issues overall", "videos students engage", "ways first collaborative", "ways individual clicker", "ways individual self", "whether current biodiversity")

#didn't include all of them here, but notice that inclusive teaching is synonmous with collaborative, working with other students (rows 123-144), being diverse, diversity (rows 198-237), and including ppl (inclusion, inclusive, including, etc.)

rios_data_token3it <- rios_data_token3it %>%
  filter(!str_detect(it_tokens_3w, paste(unnecessary2, collapse = "|")))


#for review (Naz?)
all3words <- as.data.frame(unique(rios_data_token3it$it_tokens_3w))
all3words$dei_related = NA
all3words$dei_related <- sapply(all3words$`unique(rios_data_token3it$it_tokens_3w)`, function(x) any(sapply(JEDI_keywords, str_detect, string = x)))
write_csv(all3words, "3DEIRelated.csv")


#graph of the most common
rios_data_token3it %>%
  filter(dei_related == "TRUE") %>%
  count(it_tokens_3w, sort = TRUE) %>%
  top_n(30) %>%
  mutate(it_tokens_3w = reorder(it_tokens_3w, n)) %>%
  ggplot(aes(it_tokens_3w, n)) +
  geom_col() +
  coord_flip() +
  labs(y = "(DEI Related) 3 Word Count in Inclusive Teaching Text") + 
  xlab(NULL)

In Depth Bar Chart of Word Count By Year

This is a bar chart that looks more in depth into the frequency of words being used, each year. You can click on each tab to see to compare. Although 2014-2016 have higher proportions of DEI words, from 2018-2022, the top 20 DEI related words are used more frequently, specifically the words “diverse”, “diversity”, “engage”, “individual”, and “inclusive” are being used more. We can see that the number and diversity of words increases (there are decreases in 2018, and 2021) each year.

# Below is just the code to print all 9 years into one table

#saving for visuals on word counts, etc
it_word_counts <- rios_data_tokenizedit %>%
  filter(dei_relatedit == "TRUE") %>%
  group_by(Year) %>%
  count(inclusive_teach_tokens, sort = TRUE)


it_word_counts %>%
  # filter(Year != 2018) %>%
  filter(n > 1) %>% #41.58 of data
  ggplot(aes(inclusive_teach_tokens, n)) +
  geom_col() +
 #geom_text(aes(label = inclusive_teach_tokens), vjust = -0.5, size = 1, nudge_y = 1) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  facet_wrap(~Year, ncol = 2) +
  ylim(0,33) +
  labs(title = "(DEI Related) Word Frequency Over Time", x = "(DEI Related) Word", y = "Word Count in Inclusive Teaching Text") +
   ## reduce spacing between labels and bars
  scale_x_discrete(expand = c(.01, .01)) +
  scale_fill_identity(guide = "none") +
  ## get rid of all elements except y axis labels + adjust plot margin +
  theme(axis.text.y = element_text(size = 14, hjust = 1, family = "Fira Sans"),
        plot.margin = margin(rep(15, 4)))

# it_word_counts %>%
#   ungroup(Year) %>%
#   count(n) %>%
#   mutate(percent = (nn/416)*100)

2014

it_word_counts %>%
  filter(Year == 2014) %>%
  head(20) %>%
  ggplot(aes(inclusive_teach_tokens, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,33)

2015

it_word_counts %>%
  filter(Year == 2015) %>%
  head(20) %>%
  ggplot(aes(inclusive_teach_tokens, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,33)

2016

it_word_counts %>%
  filter(Year == 2016) %>%
  head(20) %>%
  ggplot(aes(inclusive_teach_tokens, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,33)

2017

it_word_counts %>%
  filter(Year == 2017) %>%
  head(20) %>%
  ggplot(aes(inclusive_teach_tokens, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,33)

2018

it_word_counts %>%
  filter(Year == 2018) %>%
  head(20) %>%
  ggplot(aes(inclusive_teach_tokens, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,33)

2019

it_word_counts %>%
  filter(Year == 2019) %>%
  head(20) %>%
  ggplot(aes(inclusive_teach_tokens, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,33)

2020

it_word_counts %>%
  filter(Year == 2020) %>%
  head(20) %>%
  ggplot(aes(inclusive_teach_tokens, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,33)

2021

it_word_counts %>%
  filter(Year == 2021) %>%
  head(20) %>%
  ggplot(aes(inclusive_teach_tokens, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,33)

2022

it_word_counts %>%
  filter(Year == 2022) %>%
  head(20) %>%
  ggplot(aes(inclusive_teach_tokens, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,33)

Normalized Word Frequency: The Most Distinctive Words By Year

This can help us see the “weight” of each words and the words most distinctive for each year. Here we’re printing the idf, which is the “inverse document frequency, which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. This can be combined with term frequency to calculate a term’s tf-idf [the term frequency and idf multiplied together], the frequency of a term adjusted for how rarely it is used. The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites”.

For comparisons purposes, the y axis has the same limits for all the graphs, where we can see that the words “cultured” and disengaged” are the most distinct words compared to the other DEI related words for the rest of the years. These results make sense and align with the visuals above. Because the use of these words are very low in 2014 (besides diversity and diverse which was seen more than 5 times in 2014), they have a high tf-idf statistic. These words have been used more each year.

2014

#finding the most distinctive words for each document
it_word_counts %>%
  bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(Year == 2014) %>%
  top_n(40) %>%
  mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
  ggplot(aes(inclusive_teach_tokens, tf_idf)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2014", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.095)

2015

it_word_counts %>%
  bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(Year == 2015) %>%
  top_n(30) %>%
  mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
  ggplot(aes(inclusive_teach_tokens, tf_idf)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2015", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.095)

2016

it_word_counts %>%
  bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(Year == 2016) %>%
  top_n(30) %>%
  mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
  ggplot(aes(inclusive_teach_tokens, tf_idf)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2016", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.095)

2017

it_word_counts %>%
  bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(Year == 2017) %>%
  top_n(30) %>%
  mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
  ggplot(aes(inclusive_teach_tokens, tf_idf)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2017", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.095)

2018

it_word_counts %>%
  bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(Year == 2018) %>%
  top_n(30) %>%
  mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
  ggplot(aes(inclusive_teach_tokens, tf_idf)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2018", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.095)

2019

it_word_counts %>%
  bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(Year == 2019) %>%
  top_n(30) %>%
  mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
  ggplot(aes(inclusive_teach_tokens, tf_idf)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2019", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.095)

2020

it_word_counts %>%
  bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(Year == 2020) %>%
  top_n(30) %>%
  mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
  ggplot(aes(inclusive_teach_tokens, tf_idf)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2020", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.095)

2021

it_word_counts %>%
  bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(Year == 2021) %>%
  top_n(30) %>%
  mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
  ggplot(aes(inclusive_teach_tokens, tf_idf)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2021", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.095)

2022

it_word_counts %>%
  bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(Year == 2022) %>%
  top_n(30) %>%
  mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
  ggplot(aes(inclusive_teach_tokens, tf_idf)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2022", x = "DEI Related Words", y = "Word Weight (tf-idf statistic)") + ylim(0,0.095)

Network Plot of Word Relationship over Time

To get a deeper understanding of how inclusive teaching is viewed, we will be creating a network plot to look at the relationship between words/phrases in the inclusive teaching section. The generated igraph graph is called rios_phrase_network. It has 41 words and 36 connections among them. Similar to what some of our graphs have portrayed above, the words “inclusive”, “students”, and “diverse” are connected to many other words.

2014

rios_data_token2 <- rios_data_token2it %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_phrase_network <- rios_data_token2 %>% 
  filter(dei_related == TRUE & Year == 2014) %>%
  count(word1, word2, sort = TRUE) %>% 
  graph_from_data_frame()


set.seed(20181005)

a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")

ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) + 
    geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  labs(title = "Network Plot of (DEI Related) Word Relationship in 2014")

2015

rios_data_token2 <- rios_data_token2it %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_phrase_network <- rios_data_token2 %>% 
  filter(dei_related == TRUE & Year == 2015) %>%
  count(word1, word2, sort = TRUE) %>% 
  graph_from_data_frame()


set.seed(20181005)

a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")

ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) + 
    geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)

2016

rios_data_token2 <- rios_data_token2it %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_phrase_network <- rios_data_token2 %>% 
  filter(dei_related == TRUE & Year == 2016) %>%
  count(word1, word2, sort = TRUE) %>% 
  graph_from_data_frame()


set.seed(20181005)

a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")

ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) + 
    geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)

2017

rios_data_token2 <- rios_data_token2it %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_phrase_network <- rios_data_token2 %>% 
  filter(dei_related == TRUE & Year == 2017) %>%
  count(word1, word2, sort = TRUE) %>% 
  graph_from_data_frame()


set.seed(20181005)

a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")

ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) + 
    geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)

2018

rios_data_token2 <- rios_data_token2it %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_phrase_network <- rios_data_token2 %>% 
  filter(dei_related == TRUE & Year == 2018) %>%
  count(word1, word2, sort = TRUE) %>% 
  graph_from_data_frame()


set.seed(20181005)

a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")

ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) + 
    geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)

2019

rios_data_token2 <- rios_data_token2it %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_phrase_network <- rios_data_token2 %>% 
  filter(dei_related == TRUE & Year == 2019) %>%
  count(word1, word2, sort = TRUE) %>% 
  graph_from_data_frame()


set.seed(20181005)

a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")

ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) + 
    geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)

2020

rios_data_token2 <- rios_data_token2it %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_phrase_network <- rios_data_token2 %>% 
  filter(dei_related == TRUE & Year == 2020) %>%
  count(word1, word2, sort = TRUE) %>% 
  graph_from_data_frame()


set.seed(20181005)

a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")

ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) + 
    geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)

2021

rios_data_token2 <- rios_data_token2it %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_phrase_network <- rios_data_token2 %>% 
  filter(dei_related == TRUE & Year == 2021) %>%
  count(word1, word2, sort = TRUE) %>%  
  graph_from_data_frame()


set.seed(20181005)

a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")

ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) + 
    geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)

2022

rios_data_token2 <- rios_data_token2it %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_phrase_network <- rios_data_token2 %>% 
  filter(dei_related == TRUE & Year == 2022) %>%
  count(word1, word2, sort = TRUE) %>% 
  graph_from_data_frame()


set.seed(20181005)

a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")

ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) + 
    geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)  +
  labs(title = "Network Plot of (DEI Related) Word Relationship in 2022")

References

CourseSource. QUBES. (n.d.). Retrieved October 2022, from https://qubeshub.org/community/groups/coursesource/

Dewsbury, B., & Brame, C. J. (2019). Inclusive teaching. CBE—Life Sciences Education, 18(2), 1–5. https://doi.org/10.1187/cbe.19-01-0021